Goto

Collaborating Authors

 policy vector


ARelated Work

Neural Information Processing Systems

Transfer in reinforcement learning aims at solving a new target task with no additional learning or sample-efficiently by exploiting agents and information obtained from source tasks. We review a line of research with relevant approaches. This group of approaches reuses policies learned on source tasks for target tasks. Fernández and Veloso [17] suggest an exploration strategy for the learning of a new policy given a new task and learned source policies, where the gain of using each policy is estimated together on-line and one of the policies in the set is selected probabilistically at each step, based on the gain, but they focus on aiding the training of the target policy with samples from the target task rather than improving the zero-shot transfer performance. On the other hand, Dayan [14] introduce successor representations (SRs), state space occupancy representations disentangled from rewards, which allow linear decomposition of value functions.


ARelatedWork

Neural Information Processing Systems

This group of approaches reuses policies learned on source tasks for target tasks. There is a series of studies that directly exploits the smoothness ofoptimal valuesacross taskswithfunction approximators. Figure 9: The performance profiles [2, 15] of the inference with GPI and constrained GPI on Reacher. For its use in the zero-shot transfer problem, we first set four fixed goal locations at (0.1,0.0),(0.0,0.1),( Our first observation is that while the transferred agents perform comparably on some tasks, constrained GPI makes significant differences on the others, especially more on the "Harsh" target tasks with many 1's as elements in their task vectors.


Learning to Plan via a Multi-Step Policy Regression Method

arXiv.org Artificial Intelligence

We propose a new approach to increase inference performance in environments that require a specific sequence of actions in order to be solved. This is for example the case for maze environments where ideally an optimal path is determined. Instead of learning a policy for a single step, we want to learn a policy that can predict n actions in advance. Our proposed method called policy horizon regression (PHR) uses knowledge of the environment sampled by A2C to learn an n dimensional policy vector in a policy distillation setup which yields n sequential actions per observation. We test our method on the MiniGrid and Pong environments and show drastic speedup during inference time by successfully predicting sequences of actions on a single observation.


Efficient Decoupled Neural Architecture Search by Structure and Operation Sampling

arXiv.org Machine Learning

We propose a novel neural architecture search algorithm via reinforcement learning by decoupling structure and operation search processes. Our approach samples candidate models from the multinomial distribution on the policy vectors defined on the two search spaces independently. The proposed technique improves the efficiency of architecture search process significantly compared to the conventional methods based on reinforcement learning with the RNN controllers while achieving competitive accuracy and model size in target tasks. Our policy vectors are easily interpretable throughout the training procedure, which allows to analyze the search progress and the discovered architectures; the black-box characteristics of the RNN controllers hamper understanding training progress in terms of policy parameter updates. Our experiments demonstrate outstanding performance compared to the state-of-the-art methods with a fraction of search cost.